时空动作检测是视频理解的重要组成部分。当前的时空动作检测方法将首先使用对象检测器获得人候选建议。然后,该模型将将候选人分为不同的行动类别。所谓的两阶段方法很重,很难在现实世界应用中应用。一些现有的方法使用统一的模型结构,但它们使用香草模型的性能不佳,并且通常需要额外的模块来提高性能。在本文中,我们探讨了建立端到端时空动作探测器的策略,其修改最少。为此,我们提出了一种名为ME-STAD的新方法,该方法以端到端的方式解决了空间 - 周期性动作检测问题。除模型设计外,我们还提出了一种新颖的标签策略,以处理空间数据集中的稀疏注释。提出的ME-STAD比原始的两阶段探测器和减少80%的FLOPS取得更好的结果(2.2%的MAP增强)。此外,我们提出的我的stad仅具有先前方法的最小修改,并且不需要额外的组件。我们的代码将公开。
translated by 谷歌翻译
在联邦学习方案中,多方共同从其各自的数据中学习模型,有两个相互矛盾的目标是选择适当的算法。一方面,必须在存在\ textit {semi-honest}合作伙伴的情况下尽可能保持私人和敏感的培训数据,而另一方面,必须在不同方之间交换一定数量的信息学习实用程序。这样的挑战要求采用隐私的联合学习解决方案,该解决方案最大程度地提高了学习模型的效用,并维护参与各方的私人数据的可证明的隐私保证。本文说明了一个一般框架,即a)从统一信息理论的角度来制定隐私损失和效用损失之间的权衡,而b)在包括随机化,包括随机性,包括随机的机制,包括随机性,,包括随机性,,包括随机性,,包括随机性,,包括随机性,,包括随机性,,包括随机性,,包括随机性,包括随机性,,使用稀疏性和同态加密。结果表明,一般而言\ textit {没有免费的午餐来进行隐私 - 私人权衡取舍},并且必须用一定程度的降级效用进行保存隐私。本文中说明的定量分析可以作为实用联合学习算法设计的指导。
translated by 谷歌翻译
联合学习模型是根据多方拥有的宝贵培训数据进行协作开发的。在联合模型的开发和部署过程中,它们会面临风险,包括非法复制,重新分配,滥用和/或自由骑行。为了解决这些风险,联合学习模型的所有权验证是保护联合学习模型知识产权(IPR)(即Fedipr)的先决条件。我们提出了一种新颖的联邦深神经网络(FedDNN)所有权验证计划,该计划允许将专用水印嵌入并进行验证,以声称是FedDNN模型的合法IPR。在拟议的计划中,每个客户都独立验证了模型水印的存在,并索赔联合模型的所有权,而没有透露私人培训数据也没有私人水印信息。从理论上讲,嵌入式水印的有效性是通过对多个客户私下嵌入并检测到的水印的严格分析来证明的。此外,关于计算机视觉和自然语言处理任务的广泛实验结果表明,可以嵌入并可靠地检测到不同的位水印,而不会损害原始模型性能。我们的水印方案还具有各种联合训练环境的弹性,并防止拆除攻击。
translated by 谷歌翻译
Large training data and expensive model tweaking are standard features of deep learning for images. As a result, data owners often utilize cloud resources to develop large-scale complex models, which raises privacy concerns. Existing solutions are either too expensive to be practical or do not sufficiently protect the confidentiality of data and models. In this paper, we study and compare novel \emph{image disguising} mechanisms, DisguisedNets and InstaHide, aiming to achieve a better trade-off among the level of protection for outsourced DNN model training, the expenses, and the utility of data. DisguisedNets are novel combinations of image blocktization, block-level random permutation, and two block-level secure transformations: random multidimensional projection (RMT) and AES pixel-level encryption (AES). InstaHide is an image mixup and random pixel flipping technique \cite{huang20}. We have analyzed and evaluated them under a multi-level threat model. RMT provides a better security guarantee than InstaHide, under the Level-1 adversarial knowledge with well-preserved model quality. In contrast, AES provides a security guarantee under the Level-2 adversarial knowledge, but it may affect model quality more. The unique features of image disguising also help us to protect models from model-targeted attacks. We have done an extensive experimental evaluation to understand how these methods work in different settings for different datasets.
translated by 谷歌翻译
A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.
translated by 谷歌翻译
Solving real-world optimal control problems are challenging tasks, as the system dynamics can be highly non-linear or including nonconvex objectives and constraints, while in some cases the dynamics are unknown, making it hard to numerically solve the optimal control actions. To deal with such modeling and computation challenges, in this paper, we integrate Neural Networks with the Pontryagin's Minimum Principle (PMP), and propose a computationally efficient framework NN-PMP. The resulting controller can be implemented for systems with unknown and complex dynamics. It can not only utilize the accurate surrogate models parameterized by neural networks, but also efficiently recover the optimality conditions along with the optimal action sequences via PMP conditions. A toy example on a nonlinear Martian Base operation along with a real-world lossy energy storage arbitrage example demonstrates our proposed NN-PMP is a general and versatile computation tool for finding optimal solutions. Compared with solutions provided by the numerical optimization solver with approximated linear dynamics, NN-PMP achieves more efficient system modeling and higher performance in terms of control objectives.
translated by 谷歌翻译
The task of reconstructing 3D human motion has wideranging applications. The gold standard Motion capture (MoCap) systems are accurate but inaccessible to the general public due to their cost, hardware and space constraints. In contrast, monocular human mesh recovery (HMR) methods are much more accessible than MoCap as they take single-view videos as inputs. Replacing the multi-view Mo- Cap systems with a monocular HMR method would break the current barriers to collecting accurate 3D motion thus making exciting applications like motion analysis and motiondriven animation accessible to the general public. However, performance of existing HMR methods degrade when the video contains challenging and dynamic motion that is not in existing MoCap datasets used for training. This reduces its appeal as dynamic motion is frequently the target in 3D motion recovery in the aforementioned applications. Our study aims to bridge the gap between monocular HMR and multi-view MoCap systems by leveraging information shared across multiple video instances of the same action. We introduce the Neural Motion (NeMo) field. It is optimized to represent the underlying 3D motions across a set of videos of the same action. Empirically, we show that NeMo can recover 3D motion in sports using videos from the Penn Action dataset, where NeMo outperforms existing HMR methods in terms of 2D keypoint detection. To further validate NeMo using 3D metrics, we collected a small MoCap dataset mimicking actions in Penn Action,and show that NeMo achieves better 3D reconstruction compared to various baselines.
translated by 谷歌翻译
A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image captioning, text-to-image generation, and vision-language representation learning. So far, research has focused on the relationships between images and text. For example, captioning models attempt to understand the semantics of images which are then transformed into text. An important question is: which annotation reflects best a deep understanding of image content? Similarly, given a text, what is the best image that can present the semantics of the text? In this work, we argue that the best text or caption for a given image is the text which would generate the image which is the most similar to that image. Likewise, the best image for a given text is the image that results in the caption which is best aligned with the original text. To this end, we propose a unified framework that includes both a text-to-image generative model and an image-to-text generative model. Extensive experiments validate our approach.
translated by 谷歌翻译
Model-based attacks can infer training data information from deep neural network models. These attacks heavily depend on the attacker's knowledge of the application domain, e.g., using it to determine the auxiliary data for model-inversion attacks. However, attackers may not know what the model is used for in practice. We propose a generative adversarial network (GAN) based method to explore likely or similar domains of a target model -- the model domain inference (MDI) attack. For a given target (classification) model, we assume that the attacker knows nothing but the input and output formats and can use the model to derive the prediction for any input in the desired form. Our basic idea is to use the target model to affect a GAN training process for a candidate domain's dataset that is easy to obtain. We find that the target model may distract the training procedure less if the domain is more similar to the target domain. We then measure the distraction level with the distance between GAN-generated datasets, which can be used to rank candidate domains for the target model. Our experiments show that the auxiliary dataset from an MDI top-ranked domain can effectively boost the result of model-inversion attacks.
translated by 谷歌翻译
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem$\unicode{x2014}$One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.
translated by 谷歌翻译